9 research outputs found

    Automatic scheduling of image processing pipelines

    Get PDF

    Efficient Tensor Cores support in TVM for Low-Latency Deep learning

    No full text
    Deep learning algorithms are gaining popularity in autonomous systems. These systems typically have stringent latency constraints that are challenging to meet given the high computational demands of these algorithms. Nvidia introduced Tensor Cores (TCs) to speed up some of the most commonly used operations in deep learning algorithms. Compilers (e.g., TVM) and libraries (e.g., cuDNN) focus on the efficient usage of TCs when performing batch processing. Latency sensitive applications can however not exploit large batch processing. This paper presents an extension to the TVM compiler that generates low latency TCs implementations, particularly for batch size 1. Experimental results show that our solution reduces the latency on average by 14% compared to the cuDNN library on a Desktop RTX2070 GPU, and by 49% on an Embedded Jetson Xavier GPU

    Loop transformations leveraging hardware prefetching

    No full text
    Memory-bound applications heavily depend on the bandwidth of the system in order to achieve high performance. Improving temporal and/or spatial locality through loop transformations is a common way of mitigating this dependency. However, choosing the right combination of optimizations is not a trivial task, due to the fact that most of them alter the memory access pattern of the application and as a result interfere with the efficiency of the hardware prefetching mechanisms present in modern architectures. We propose an optimization algorithm that analytically classifies an algorithmic description of a loop nest in order to decide whether it should be optimized stressing its temporal or spatial locality, while also taking hardware prefetching into account. We implement our technique as a tool to be used with the Halide compiler and test it on a variety of benchmarks. We find an average performance improvement of over 40% compared to previous analytical models targeting the Halide language and compiler

    Schedule Synthesis for Halide Pipelines on GPUs

    No full text
    The Halide DSL and compiler have enabled high-performance code generation for image processing pipelines targeting heterogeneous architectures through the separation of algorithmic description and optimization schedule. However, automatic schedule generation is currently only possible for multi-core CPU architectures. As a result, expert knowledge is still required when optimizing for platforms with GPU capabilities. In this work, we extend the current Halide Autoscheduler with novel optimization passes to efficiently generate schedules for CUDA-based GPU architectures. We evaluate our proposed method across a variety of applications and show that it can achieve performance competitive with that of manually tuned Halide schedules, or in many cases even better performance. Experimental results show that our schedules are on average 10% faster than manual schedules and over 2× faster than previous autoscheduling attempts

    ConvFusion A Model for Layer Fusion in Convolutional Neural Networks

    Get PDF
    The superior accuracy and appealing universality of convolutional neural networks (CNNs) as a generic algorithm for many classification tasks have made the design of energy efficient CNN accelerators an important topic in both academia and industry. Of particular interest in the design and use of CNN accelerators is the scheduling of the computational workload, which can have a major impact on the quality of the final design. The many inherently independent operations in CNNs result in a vast scheduling space however, rendering the selection of the optimal schedule(s) non-trivial. To aid in this complex task, this work introduces a generic mathematical cost model of the external memory accesses, internal memory footprint, and compute load for CNN execution schedules. The model enables fast exploration of the scheduling space, including loop tiling, loop reordering, explicit data transfer scheduling, recomputation, and, crucially, layer fusion, which recently has attracted interest as a method to reduce external memory accesses. An accompanying open source tool is released to perform schedule space exploration for CNNs using the introduced cost model. Leveraging the code generation capabilities of this tool the proposed model is validated on six real world networks, demonstrating that layer fusion can reduce the external memory accesses by more than two orders of magnitude compared to the best non-fused schedules. Confusing at first glance however, a high-level energy analysis shows that the practical benefits of layer fusion may be overestimated if other parts of the system are not tuned accordingly

    Programming tensor cores from an image processing DSL

    No full text
    Tensor Cores (TCUs) are specialized units first introduced by NVIDIA in the Volta microarchitecture in order to accelerate matrix multiplications for deep learning and linear algebra workloads. While these units have proved to be capable of providing significant speedups for specific applications, their programmability remains difficult for the average user. In this paper, we extend the Halide DSL and compiler with the ability to utilize these units when generating code for a CUDA based NVIDIA GPGPU. To this end, we introduce a new scheduling directive along with custom lowering passes that automatically transform a Halide AST in order to be able to generate code for the TCUs. We evaluate the generated code and show that it can achieve over 5X speedup compared to Halide manual schedules without TCU support, while it remains within 20% of the NVIDIA cuBLAS implementations for mixed precision GEMM and within 10% of manual CUDA implementations with WMMA intrinsics

    Schedule synthesis for halide pipelines through reuse analysis

    Get PDF
    Efficient code generation for image processing applications continues to pose a challenge in a domain where high performance is often necessary to meet real-time constraints. The inherently complex structure found in most image-processing pipelines, the plethora of transformations that can be applied to optimize the performance of an implementation, as well as the interaction of these optimizations with locality, redundant computation and parallelism, can be indentified as the key reasons behind this issue. Recent domain-specific languages (DSL) such as the Halide DSL and compiler attempt to encourage high-level design-space exploration to facilitate the optimization process. We propose a novel optimization strategy that aims to maximize producer-consumer locality by exploiting reuse in image-processing pipelines. We implement our analysis as a tool that can be used alongside the Halide DSL to automatically generate schedules for pipelines implemented in Halide and test it on a variety of benchmarks. Experimental results on three different multi-core architectures show an average performance improvement of 40% over the Halide Auto-Scheduler and 75% over a state-of-the art approach that targets the PolyMage DSL

    Automatic memory-efficient scheduling of CNNs

    No full text
    Accessing large external DRAM is costly, and poses a challenge to efficiently evaluate data-intensive convolutional neural networks (CNNs) on embedded devices. These external memory accesses can be minimized by exploiting data reuse in on-chip memory. Selecting the combination of code transformations that minimize the external DRAM accesses is however an extremely complex task. In this work a mathematical model is presented to quickly and very precisely evaluate combinations of code transformations on CNNs. An accompanying open source tool is developed which leverages this model to perform automated design space exploration and code generation for CNNs. The correctness of the developed model is demonstrated by measurement of seven neural networks. Results show the transformations selected by the tool can reduce external memory accesses by over an order of magnitude
    corecore